-
-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend repitition test to span across two files #349
Conversation
Strange, the test succeeds:
with contradicts my tests where in this case the URLs were checked and counted once per file. Did I do the syntax right to add the two input files? Or the issue was fixed in the meantime between my tests and this commit? 🤔 |
Okay, "great", no it fails as expected 🙂. |
Here is the part which should filter out cached links. Mandatory for this to work of course is that there is only a single // Filter out already cached links (duplicates)
links.retain(|l| !self.cache.contains(&l.uri)); |
I wonder why we even have that line at all. Given that |
Hmm, you mean identical URIs then have an identical hash (=key) so that duplicates shouldn't be possible? I also see only a single Ah wait, link checks start already (async) after one file has been processed, right? Is it possible that a link is checked before the duplicate has been parsed, and when being parsed a second time the cached result is overridden or so? |
In Otherwise, awesome, the cache now seems to work 👍. Probably |
Failing tests:
|
Okay, I think something has gone wrong. In @mre
|
I tried to fix the issue that each loop clears The whole issue with the cache, why it didn't work as expected, seems to be due to the race condition between multiple |
Adding the cache to the collector was a mistake. The collector should have a single responsibility: collecting links. If anything then we should have a cache for requests. There it would be a simple lookup if a URL was already requested with the added benefit of tracking the number of duplicate links in the future. In conclusion I'd vote for removing the cache entirely for now and add it back later. 😉 |
Just for my understanding:
So there needs to be one global cache, links or requests pool, and it shouldn't matter which one it is, as long as duplicates across all concurrent instances are filtered at some point, in theory the earlier the better (less overhead). I'm not sure why it doesn't work currently (side of the currently false syntax 😉), so if you don't see an easy way to fix it the way it was intended initially, I agree to remove it for now, especially since I'm looking forward for a release with local files support 🙂. Wouldn't it be possible that all collect instances feed a global links pool, being a hash set to avoid duplicates in the first place, and to have all request handlers then directly pick from this global pool (locking/tagging the entry first so that it is not handled a second time)? That way we don't need a separate cache to check against. |
Okay how shall we proceed? The cache does not work reliably, but if really concurrency/race condition is the issue, then it should at least work partly when using multiple and larger inputs, and I do not see a general issue to filter it in the collector already, at least until we find time for a better implementation. I wonder whether #330 will help 🤔. Probably we could leave things for now and rebase the two-files test extension afterwards. If anyone knows how to fix the current state so that But otherwise it is trivial to remove the parts with the cache for now and revert the tests to check for filtered repetitions in a single file only. |
Signed-off-by: MichaIng <[email protected]>
I think there is an option to just add a |
Sounds like a more native solution 👍. Btw, shall I outsource the HTTP => HTTPS changes into an own PR? |
Reqwest comes with its own request pool, so there's no need in adding another layer of indirection. This also gets rid of a lot of allocs.
Signed-off-by: MichaIng <[email protected]>
Reqwest comes with its own request pool, so there's no need in adding another layer of indirection. This also gets rid of a lot of allocs.
Signed-off-by: MichaIng <[email protected]>
I've removed caching across multiple files as discussed. |
How is the collector now tied to the clients? I had the impression that one collector iteration was tied to one client, but when there is no client pool now but a single instance with internal threading/pool, may this even solve the problem of concurrent collector iterations? |
Not sure I understand. The collector and clients were always independent (?). The collector would return a Sorry, I'm pretty sure I misunderstood. 😅 |
Apart from that, this is ready to be merged from my point of view. |
Ah okay, was my misunderstanding then. I was just wondering where this Actually I don't understand where the concurrency of the link collection comes from, to me it looks like there is a single |
Yeah something like that. TBH I don't really care because the current cache impl is broken and too entangled with the collector. We'll find a better option like using a XOR filter, which is very memory-efficient and fast. But that's for another PR. 😉 Will merge this one now unless anyone has any final comments. |
@MichaIng I've merged the changes now. Let's build a better caching layer in the future. As always, thanks for the initial PR and the feedback during the development. 😃 |
Meh, just tested |
Yes I can confirm, on our website, with latest master, execution time raised from ~8 seconds to 2 minutes, on the docs from 13 seconds to 13 minutes (!). That client pool definitely needs to stay until another way of concurrency is found. |
Created a PR and will merge it when it gets green. That should take care of it. |
A while ago, caching was removed due to some issues (see #349). This is a new implementation with the following improvements: * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version. Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before. * Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests. * Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either). This is an optional feature. Caching only gets used if the `--cache` flag is set.
A first step to address: #348